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Abstract 

Structured sparsity has recently emerged in statistics, machine learning and signal process¬ 
ing as a promising paradigm for learning in high-dimensional settings. All existing methods 
for learning under the assumption of structured sparsity rely on prior knowledge on how to 
weight (or how to penalize) individual subsets of variables during the subset selection process, 
which is not available in general. Inferring group weights from data is a key open research 
problem in structured sparsity. In this paper, we propose a Bayesian approach to the problem 
of group weight learning. We model the group weights as hyperparameters of heavy-tailed 
priors on groups of variables and derive an approximate inference scheme to infer these hyper¬ 
parameters. We empirically show that we are able to recover the model hyperparameters when 
the data are generated from the model, and we demonstrate the utility of learning weights in 
synthetic and real denoising problems. 


1 Introduction 


High-dimensional prediction problems are more and more common in many application domains 
such as computational biology, signal processing, computer vision or natural language processing. 
To handle this high-dimensionality, one usually resorts to linear modeling and regularization with 
sparsity-inducing norms, such as the £i norm. This type of regularization results in sparse mod¬ 
els, meaning that the model is described by relatively few parameters. Besides making parameter 
learning consistent in high-dimensional settings, the sparsity assumption has the appealing prop¬ 
erty of yielding more interpretable models. As an example, consider the problem of explaining 
a particular phenotype of patients, e.g., the disease state, based on the genome sequence of each 
patient. Sparse linear approaches try to find a handful of genetic loci that govern the disease state, 
rather than a model involving the whole sequence. The £i-regularized sparse linear models, such 
as the LASSO (Tibshirani, 19941 or basis pursuit ( |Chen et al.\ 1999), are well studied by now, with 
a solid body of theoretical results, efficient algorithms and applications in diverse fields (see, e.g., 
Biihlmann and van de Geer, 2011 and references therein). However, in practice, we often know 


that there is more structure in the problem at hand, which cannot be captured by simple sparse 
modeling and £i regularization, and which, if exploited, can improve the estimation of parameters 
as well as the interpretability of the estimates (see Cevher et al. 2008 Huang et al. 2011 Bach et 


al. 2012b, and references therein). In our example, we could expect the genetic loci that influence 


the disease to be part of a small number of connected patterns in a known gene-gene interaction 
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Figure 1: The coefhcient vector w is covered by latent variables supported on subsets A, B and C: 
w = va+vb A VC- 


network (Rapaport et al. 2007 Azencott et al. [2013 ). In other words, we could be looking for 


a small number of possibly overlapping subsets of variables such that each subset corresponds to 
a connected subgraph in a given gene network, and the combination of variables in each subset 
influences the phenotype. 

Given prior knowledge about the relevance of each considered group of variables, several meth¬ 
ods exist for learning sparse models guided by this prior knowledge. These methods achieve 
different kinds of structured sparsity by regularization (penalization, weighting) with appropriate 
sparsity-inducing norms, that often correspond to convex relaxations of combinatorial penalties on 
the support (i.e., the set of indices of non-zero components) of the parameter vector. After the 
group LASSO (Yuan and Lin, 2006), a number of convex penalties have been proposed, generalizing 


the group LASSO penalty to the cases of overlapping groups (Zhao et al. 2009 Jacob et al. 


Jenatton et al. 


Jenatton et al. 


2011a 


2011b). See (Bach et al. 


Chen et al.: 2012), including tree-structured groups (Kim and Xing 


2009 


2010 


2012a|b| for a more detailed review of sparsity-inducing 


norms. 


While most of these norms induce intersection-closed sets of non-zero patterns, [Jacob et al.\ 


(20091 and Obozinski and Bach (2012) introduce a different, latent formulation of sparsity-inducing 


norms that yields union-closed sets of non-zero patterns, meaning that the parameter vector w is 
represented as a sum of latent vectors va, identically zero at indices not in A for a subset A of 
indices. If several such sets of indices are considered, then the support of w (i.e., the set of indices 
i for which wt is non-zero) is included in the union of such sets (see Figure for illustration with 
three sets A, B and C). 

In order to quantify the intuition above, Obozinski and Bachj ( |2012[ ) consider the following 
function on the support supp(ry) of w. 


g{snpp{w)) = 


mm 
A'QA, 


A&A' 


fiA), 


( 1 ) 


that is, g(supp(r(;)) is the minimum-weight cover of supp(u') with the subsets A in the family A. 
The weights f{A) express our prior belief in the subset A being relevant: If a group A is irrelevant, 
then f{A) = oo. Using the function g as a regularizer (essentially the approach of Huang et al. 
(20111) will encourage the support of the parameter vector w to be a union of subsets A £ A with 
finite /(A). 


Moreover, Obozinski and Bach (2012) computed a convex relaxation of the function g defined 


above, leading to the following norm il(w) equal to: 

hAhfiAy^^ s.t. 


Va = 'W’ 


( 2 ) 


However, generally we do not have this prior knowledge about the relevance of individual groups: 
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The problem of automatically choosing appropriate weights for groups of variables, f{A), is an 
important open research problem in structured sparsity. Assuming that we have several learning 
problems with similar structure (the relevance of a given group is largely shared across individual 
problems), in this paper we propose a framework for learning group relevances from data. Note 
that learning the structure is naturally a multi-task problem, as it is impossible to estimate the 
prior on a vector of parameters if we only observe one particular instance of it. To come back to our 
example, we could assume that we have several phenotypes that can be explained by groups of loci 
whose relevance is largely shared across phenotypes. A recent approach to learning group relevances 
from data has been proposed by Hernandez-Lobato and Hernandez-Lobato (20131. However, this 
work only considers learning relevances of pairs of variables and does not make the link with 
sparsity-inducing norms. Let us also mention that probabilistic modeling for structured sparsity 


has also been explored by Marlin and Murphy (2009) and 


Marlin et al. (2009) in the context of 


(2014) for multi-task learning with structure 


learning Gaussian graphical models, and by Han et al. 
on tasks. 

We approach the problem using probabilistic modeling with a broad family of heavy-tailed 
priors and derive a variational inference scheme to learn the parameters of these priors. Our model 


follows the pattern of sparse Bayesian models (Palmer et al. 2006 Seeger and Nickisch 2011 


among others), that we take two steps further: First, we propose a more general formulation, 
suitable for structured sparsity with any family of groups; Second, we learn the prior parameters 
from data. We show that prior parameter estimation with classical variational inference does not 
always lead to reasonable estimates in these models, and find a way of regularizing that works well 
in practice. Moreover, we propose a greedy algorithm that makes this inference scalable to settings 
in which the number of groups to consider is large. In our experiments, we show that we are able 
to recover the model parameters when the data are generated from the model, and we demonstrate 
the utility of learning penalties in image denoising. 


2 A Probabilistic Model for Structured Sparse Linear Re¬ 
gression 

In this section, we formally describe our model and a suitable approximate inference scheme. 

2.1 Model definition 

I X P 

We consider K linear regression problems with design matrices X € K and response vectors 
yk g for /c £ {ij... j K}. For each and we assume the classical Gaussian linear model 
with i.i.d. noise with variance cr^, that is. 

Let V be the set of indices of variables {1,..., P}. For a family A of subsets of H, we assume 

< ( 4 ) 

AeA 


where, for each fc, 

• VA G A, v\ is a vector in such that all its components with indices in H \ A are zero (in 
other words, it is supported on A), 

• are jointly independent, and 

• VA G A,v\ has an isotropic density with inverse scale parameter /(A) 

p{v\\f{A)) = qA{\\v\hf{Af/^)f{Ar\/\ (5) 


where qA is a heavy-tailed distribution that only depends on A through its cardinality, |A|. 


We specify qA in Section 2.2 
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We regard the inverse scale parameter f{A) as a measure of relevance of the group of vari¬ 
ables ■€1 If a group of variables is irrelevant, then f{A) should equal infinity. We are interested 
in priors qa such that for each task indexed by k only a handful of v\ can be significantly away 
from zero. 

Here it is important to stress the link between the expression of our isotropic prior ([^ and 
the norm Q{w) (U from [Obozinski and Bach| ( |2012[ ), introduced above: The log-likelihood of 
parameter vectors {w’‘}k=i,...^K with respect to / will (up to a constant) be equal to the term 

which very closely resembles the norm ([^. If qa is the generalized 
Gaussian distribution (cf. Section 2.4), the two expressions match exactly. Thus, learning with 
our prior is a natural probabilistic counterpart of learning with the sparsity-inducing norm Q . 

Given data and such a model for the prior, our goal will be to infer the 

parameters f{A) by maximizing the likelihood with respect to /, 


K 


P{y\...,y^\f) = l[ H p{v'X\f{A))dv\, 


( 6 ) 


k=l • 


AgA 


where the parameters are marginalized. 


2.2 Super-Gaussian priors 

We assume that qa is a scale mixture of Gaussians, i.e.. 


poo 

QAiu) = / J\f{u\0,s)rA{s)ds 

Jo 


for some mixing density r^(s). The main reason why we choose to work with the family of 
scale mixtures of zero-mean Gaussians is that it contains distributions that are heavy-tailed and 
therefore suitable for modeling sparsity; One such distribution is Student’s t which we use in our 
experiments. The inverse scale parameter of the distribution on f{A), captures the relevance 
of the group A: the smaller f{A), the more relevant the group, that is, the larger the values v\ is 
likely to take. Note that even if the group A is relevant, not all v\, k = 1,..., K have to be large. 
In fact, if the parameters v\,k = I,..., i^T are drawn from a heavy-tailed distribution with small 
/(H), then only a fraction of them will be significantly away from zero. Moreover, as we show 
in Section |2.3[ learning in such models is amenable to variational optimization with closed-form 
updates and leads to an approximate Gaussian posterior on v\. 

In general, the integral in Q is intractable for Gaussian scale mixtures, therefore one has to 
resort to sampling or approximate inference to learn parameters in such models. The fact that qa 
is a Gaussian scale mixture implies that it is also super-Gaussian, that is, the logarithm of qaIu) 
is convex in and non-increasing |Palmer et al. 20061 ^ 
the following form by convex conjugacy 


It therefore admits a representation of 


log QAiu) = sup--- (/>a(s), 

s>0 


(7) 


where (fAis) is convex in I/s. Note that the expression inside the supremum in Q has a unique 
maximizer. In this work we only consider qa for which this maximizer has an analytical simple 
form. From (§ and 0> we get the following variational representation for p{v\\f{A)): 


p{v’X\f{A)) = f{A) 


lAi 


sup exp 
Cl>o 


( 


4111 


fA) 


24 

4/ 


- MO) 


4 


k 


= /(H) '^'^supJaa(4|0, jf^){2n-M) 


( 8 ) 


’/(7l)A /(H). 


o-MCl) 


^Abusing notation, we will call “group A” the subset of variables indexed by elements of A throughout the paper. 
^Note that the converse is not true: complete monotonicity of t he log-density is a necessary and sufficient 
condition for the existence of a Gaussian scale mixture representation (Palmer et al. 2006 Section 3). 
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Figure 2: The graphical representation of our model. 


For a particular choice of the prior qa, we measure the relevance of the group of variables A by 
the expectation of (which amounts to the sum of the variances of the individual components 

ofv’X), 


E [11411^] 


Ell 


fiA) 


where E|| 2 || 2 ..^q^ [IkHi] is the expectation of ||z ||2 under the standardized distribution qa on \\z\\ 2 . 
In fact, as we have 

E[lk'=||^] = E®[II^Alli] 

AeA 


given our independence assumption, the expected value of ||u^||| allows us to measure the con¬ 
tribution of the group A with respect to E [Hui^Hl]. We somewhat abusively call Ethe 
signal variance in our experiments, as opposed to PcP, the noise variance. Figure ^ represents 
the graphical model corresponding to our assumptions. Note that we have explicitly incorporated 
the variational parameter into the graphical model: In fact, the same parameter can also be 
interpreted as the scale paramet er of the Gaussian in the Gaussian scale mixture representation of 

Piv'XlfiA)) 


(Palmer et al.. 2006 


2.3 Inference 

Our model described above, namely the combination of the density of ([^ and the variational 
representation of the prior density on v a (|^, leads to the following variational bound on the 
marginal distribution of 

logp(/|/) 

= log f p[y^\X'^w\a^I) n p{v\\f{A))dv\ 

'' AeA 

> sup hogM{y'^\0,X'^MZ'^F-^M^X’^^ +a^I) 

Cl>o ^ 

AeA 

where M is a matrix of dimension P x J^AeA 1^1 ensures = Mv^ where is the con¬ 
catenation of all elements indexed by elements of A in , ^4 G A, and F and are square 
diagonal matrices of size Yl,AeA 1^1 whose diagonals consist of f{A) and (a respectively, replicated 
|A| times, for each A ^ A. Thus, as an approximation to minimizing the negative log-likelihood, 
we would like to minimize the following overall bound with respect to / and C^\ for all kl G .4 and 
fcG {!,...,IF}: 


-El 


- \^y'^^{x^MZ^F-^M^X ^^logdet cr^l) 

+ E ^ log(2^) + ^ logdet(Z'=F-i) - MCa)}- 


(9) 
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In its form given by (§, the bound is difficult to optimize. However, we recognize parts of it as 
minima of convex functions, which allows us to design an iterative algorithm with analytic updates, 
finding a local minimum (see the appendix for details). Our optimization problem becomes 


K 


inf inf inf ^ |^||/- ^ ||2 + TrE^J - J logdetS'^ 

A^A 

J\jk pjk 1 T 1 

+ — log(a2) + — log(27r) + — TrM^X^^ X'=ME'= " ^ E 


k^l 


+ E [“ - ^log27r + (/>A(Ci)]}, 


AeA 


AgA 


( 10 ) 


and the closed-form updates are 

E^' = X'^M + 

X^M+ a‘^FZ^~^)-^M^X'^^y^ 

cl = argmin(/)^(z) -f ^+ Tr E^^) 

z>0 ^ Z 


f{A) 


K\A\ 


EtEdkllli+TrE^j’ 


iterated until convergence. 


( 11 ) 


Remark 2.1 Note that the only update that depends on the specific prior distribution is that for 
the variational parameter C,\, all others apply to all super-Gaussian priors. 


Remark 2.2 It can he shown that the updates (111 exactly correspond to the updates yielded by 
mean-field variational inference in the special case of Gaussian scale mixtures (Palmer et al. 2006). 
However, the approach presented here is more general, as it also applies to super-Gaussian priors 
that are not Gaussian scale mixtures. 


Remark 2.3 Using the matrix inversion lemma, the update for Yf can he rewritten in such a 
way that we avoid the expensive inversion of a Eag.4 I^I ^ E.4G.4 1^1 '^no.trix and we only have 
to invert a P x P or x matrix instead, which can even he diagonal in certain cases (see 
the appendix for details). When it is not diagonal, matrix inversions can be avoided by making an 
extra diagonal assumption on the covariance matrix of the Gaussian posteriors of all v\. 

Remark 2.4 While we do provide an update equation for for completeness, in general it is 
customary to assume the noise level known, which we also do in all our experiments. 


2.4 Special cases 

The family of super-Gaussian distributions includes Student’s t and generalized Gaussian dis¬ 
tributions among many others. We here give the densities of these distributions, as well as the 
expressions for the quantities in our model and inference that depend on the particular prior on v\. 


Student’s t: The density of this distribution is given by 


p(v\\a,f{A)) 




12^(4)' 


.lAi 


( 12 ) 
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where a is a parameter governing the shape of the distribution. The smaller a, the heavier-tailed 
the distribution (for a < 1, there is no finite variance). For this distribution, 


4>a{Ca) —yk + + 1/2) log(Cl) + y log(27r) — (a -I- \A\/2) -b (a -b \A\/2) log(a -b \ A\/2.) 

a 2 (13) 

- log(r(a -b 1^1/2)) -b log(r(a)), 


and, therefore, the update for is written as 




l + i/(^)(lkllll+TrSlJ 


M 

2 


(14) 


The variance of a Student’s t-distributed random variable, if a > 1, is 

and therefore E(||r!^|| 2 ) = ■ Student’s t has a natural representation as a Gaussian scale 

mixture with the inverse Gamma as the mixing distribution. All our experiments are carried out 
using Student’s t. 


Generalized Gaussian: The density is given by 


p(u^| 7 ,/(A)) = /(A)^ 


ipi'tdii , 1 

2^1 2 1 „-|hlf(A)2||? 

TT 2 r(J-yt) 


(Pascal et al. 2013). Gonsequently, we have 


MCX) 


log 


ir(^) 

7 r^r(M) 


-b 


/-k 2-~, 


(^- 1 ) 



and E(||u^|| 2 ) 


_ ^(|A|/7-^2/7) 
- /(A)r(|A|/7) • 


|/(A)(lkllli+TrS^j7-2 x^ 

(1/7-1/2)7^ 7 ^ ’ 


(15) 


(16) 


(17) 


2.5 Learning with all groups 

While our model and the associated inference algorithm described earlier are valid for any set of 
groups A, including A = 2^, the algorithm is impractical when A is large: Indeed, even if we only 
have 20 variables and 1000 tasks, learning with A = the number of variational 

parameters Ca will exceed a billion. To avoid working with a prohibitively large number of groups 
at once, one can leverage an active set-type heuristic that maintains a list of relevant groups and 
iteratively updates it. Algorithmwhich we discuss in detail in the following, describes one way 
to do this. It requires setting the maximal allowed cardinality T of A, and the number D of groups 
to be discarded in each active set update. We start by learning with singletons only (steps 1 and 
2 ); After ranking the groups in A according to their relevance measured by into the sequence 
(Al,..., A|^|) (step 3), we determine the additional groups to be considered. A', by taking the 
first T sets from the sequence (Ai U A 2 ,..., Ai U A|^ |, A 2 U A 3 ,..., A 2 U A|^I,...), ignoring groups 
that have been considered in the past and making sure we do not add the same group more than 
once; In steps 5-11 we repeatedly (a) learn with AUA', (b) rank the groups, (c) update A and A^ 
In step 8 we choose not to discard the singletons just to make sure that A always covers {!,..., P}. 
The stopping criterion (step 5) may be that we have no more groups to consider (if P is small 
enough), or that we have reached a predehned maximal number of iterations. 
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Algorithm 1 Active set procedure for the discovery of relevant groups 
Parameters: T S N, I? S N 
1 : Let A = {1,..., P} and V = % 

2 : / •(— variational(A) 

3: Rank all A G A according to their relevance 
4 : Determine Al 

(make sure \AiJ A'\<T,A! fl{AiJV) = %) 

5: while stopping condition not met do 
6 : / ^ variational(AU A') 

7: Rank As AUA^ according to their relevance 

8 : Add to V the D least relevant non-singletons in A U A^ 

9: A i — A U A \ P 

10 : Determine A' 

(make sure |AU A'| <T, A'n(AUP) =0) 

11: end while 


3 Approximation Quality and Regularization 


The goal of this section is to experimentally study the behavior of our approximate inference 
scheme in terms of estimation quality and to clarify how we can control it. As we empirically show 
below, the variational approximation scheme from Section |2.3| tends to overestimate the variance 
of the prior distribution (i.e., underestimate the inverse scale parameter /(A)) when this variance 
is smaller than the noise variance. This is undesirable, as we would like /(A) to tend to infinity 
for irrelevant groups of variables. To circumvent this problem, we use an improper hyperprior of 
the formp(/(A)) oc /(A)^ to encourage /(A) to go to infinity when the variance of p(u^) is smaller 
than cr^. Consequently, the regularization term with /? > 0 is added to the 

objective function ([To|, and the only update that changes is that for /(A): 


f{A) = 


1 

2 


’A 


(18) 


Thus, we substitute the approximate type-II maximum likelihood estimation of /(A) 
imate (also “type-II”) maximum a posteriori estimation. In Sections 3.1 and |3.2| we 
study the effect of the parameter /? on the approximation quality. 


by approx- 
empirically 


3.1 Scale parameter inference with only one variable 

In this experiment, we evaluate the performance of the variational method described in Section [2.3| 
in recovering the unknown scale parameter / of the prior in the simplest, 1 -dimensional case (note 
that in this subsection we omit the subscripts AasA={{l}}). More specifically, our goal here is 
to answer the following questions: Given an i.i.d. sample drawn from a univariate Student’s t with 
shape and inverse scale parameters a and /, corrupted by Gaussian noise, and supposing we know 
both the noise variance tr^ and the shape parameter a, can we precisely estimate the inverse scale 
parameter / using the variational method from Section [2.3p ’ In the settings where we cannot, does 
regularization improve our estimates? 

Experimental setup. We consider 10,000 tasks with one variable and one observation each (P, 
for all k, and for all k equal to 1). Data are generated from the model with Student’s t 
prior on with parameters a set to 1.5 and / varying in the set T of 14 values between 0.02 
and 50 taken roughly uniformly on the logarithmic scale, and Gaussian noise with variance cr^ 
set to 1. We compare the performance of the variational method with that of a grid search over 
J- U {10®}, where we use the trapezoidal rule to numerically solve the intractable integral in 
The grid search, feasible in this basic setting, provides the best available approximation to the 
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p=o 


p = 0.05 


P = 0.25 




0.1 1 10 100 

variance(/) 



0.1 1 10 100 

variance(/) 


0.1 1 10 100 

variance(/) 


.true f 

- / grid search 

- / variational 


Figure 3: Recovery of the variance of the univariate Student’s t distribution with added Gaussian 
noise of known variance with grid search and the variational method, with different levels of 
regularization. The x and y axes represent the variance based on the true and on the estimated / 
parameter values, respectively. 


regularized maximum likelihood solution. To reduce the effect of random fluctuations, we repeat 
all experiments 5 times with different random seeds and report averaged results. 

Results. Figure [^summarizes the results. For three values of the parameter /3, we plot (on the 
logarithmic scale) the estimated against the true variance for the considered range of the parameter 
/ (recall that the variance of a Student’s t-distributed random variable with parameters a and / 
equals (yryjj). In all figures, we also plot the variance of the Gaussian noise cr^. We observe that 
in the absence of regularization (/3 = 0) and when the signal is not much stronger than noise, 
the variational method overestimates the signal variance while the grid search does not. As we 
add regularization, this effect gradually goes away and the signal variance estimate is set to 0 
(i.e., the estimate of /, /, goes to infinity) if the true signal variance is smaller than a certain 
threshold. When the regularization is too strong (/3 = 0.25), the estimated signal variance drops 
to 0 even when the signal is stronger than the noise, and the variance of the signal is heavily 
underestimated. With the right amount of regularization (/3 = 0.05 in this case) we observe the 
desired behavior: The variational method recovers the signal when it is stronger than noise, and 
sets / to infinity otherwise. In all cases, variational estimates are close to the maximum likelihood 
estimates obtained by the grid search when the signal is much stronger than the noise. 

3.2 Structured sparsity with two variables 

In this section, we empirically study the most basic case of the group relevance learning prob¬ 
lem. Suppose that in each task we only have 2 variables, and therefore 3 possible groups, 
A = {{!}, {2}, {1,2}}. Let be the identity matrix in each task. In this basic setting, and 
supposing that the data come from the model, can our inference algorithm distinguish the case 
where the data {y^}k=i,...,K are generated by the group of variables {1, 2} from the opposite case, 
where the relevant groups are the two singletons {1} and {2}? 

These two settings differ in fact significantly in the case of a heavy-tailed prior on We have 
“*"^{2} +^{i 2}’ relevant and {1} and {2} are not, then and will have 

to be close to zero for all k, however, significantly far from zero for some k. As the 

prior on v\ only depends on va through its norm, these 2} can be anywhere on the circle with 
radius 2} II2 with the same probability and therefore can also be anywhere on the circle with 
radius ||j/^||2- In contrast, when {1, 2} is irrelevant and {1} and {2} are relevant, the rare events of 
and vy 2 } both being significantly away from zero will not occur at the same time for most k, 
and therefore the y^ with a large norm will tend to be concentrated along the axes. This behavior 
(using Student’s t prior with parameter a = 1.5 on v\) is illustrated in Figure [^ where we have 
plotted the data {y^}k=i,...,K for K = 5,000 in both settings. 
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Singletons dominate Pair dominates 



Figure 4: On the left, the singletons are the relevant groups. On the right, the pair is the relevant 
group. 


Ideal configuration beta = 0 



0.080.5 1 2 4 8 50 0.080.5 1 2 4 8 50 

Group Group 


beta = 0.03 



0.080.5 1 2 4 8 50 
Group 


beta = 0.15 



0.080.5 1 2 4 8 50 
Group 


Figure 5: A red (blue) square means that the estimate of the singleton (group) variance is larger 
than the estimate of the group (singleton) variance for the corresponding true singleton and pair 
variances indicated by the axes. A black square means that both singleton and pair variances are 
under 2 ct^, the noise variance. Best seen in color. 


Experimental setup. We consider 5,000 tasks with P and for all k equal to 2, with the set 
of groups A = {{!}, {2}, {1, 2}}. The data are generated from the model with Student’s t prior 
on with parameters a set to 1.5 and each f{A) varying in a set of 14 values between 0.01 and 
25 taken roughly uniformly on the logarithmic scale (/({I}) and /({2}) always equal each other), 
and Gaussian noise with variance cr^ set to 1. 

Results. Figure summarizes the results for three values of the regularization parameter (3 
{(3 = 0 corresponds to the absence of regularization). We report when the estimated pair vari¬ 
ance — 1 ) j({i dominates (blue) or is dominated (red) by the estimated singleton variance 

77 provided that one of them is larger than the noise variance, 2 ct^. We see 
that when we do not regularize, the variational method explains everything with the singletons. 
As we add regularization, the pair explains more and more variance, however in such a way that 
the pair also explains the signal coming from singletons. Nonetheless, there is a regime (/3 = 0.03) 
where a strong signal coming from both the singletons and the pair is identified correctly. If we 
regularize too strongly (/? = 0.15), the entire signal is explained by the pair, regardless of its source. 


4 Experiments 

In our experiments we consider two different instances of the denoising problem and we empirically 
evaluate the performance of our approach in recovering both the signal and the structure. 

4.1 Structured sparsity in the context of denoising 

In this section, we study toy multi-task structured sparse denoising problems. Our goal is to answer 
the following questions: Given data {y^}k=i,...,K: generated from the model, and assuming that we 
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Singletons 

One group 

Overlapping 

LASSO-like 

18.5±0.3 

18.6±0.4 

58.4±1.1 

W. LASSO-like 

14.5T0.3 

14.5±0.3 

42.8T0.9 

Structured 

14.8±0.3 

13.8T0.3 

43.0T0.9 

Structured(AS) 

14.6±0.3 

14.0±0.3 

42.8±0.9 


Table 1: Squared error averaged over the tasks with 95%-confidence error bars for each combination 
of data generation and learning models. The usage of boldface indicates that the corresponding 
method significantly outperforms the others, as measured using a t-test at the level 0.05. 


know the true shape parameter a of the Student’s t and the noise variance cr^, (a) can we recover 
the structure (i.e., the relevant groups and their weights), and (b) if we use the correct structure, 
is our denoising more accurate than when using a different structure? 


Experimental setup. To this end, we consider 10,000 tasks with P and for all k equal to 10, 
with the set of groups A = {{Q}q=i,...,p, {1,. .., Q}q=2,...,p}- Each signal is generated using 
Student’s t with parameters a set to 1.5 and f{A) set to 0.2 or to 200, depending on whether A is 
considered relevant or irrelevant: In this fashion, the variance of the signal coming from relevant A 
is = 10 X 1^1 (respectively, 0.01 x |A| for irrelevant A). For each task k, is a perturbed 

version of the signal with additive Gaussian noise of variance 
We consider three different ways of generating data: 

• Singletons: Here, only {1},..., {5} are relevant, all other groups in A are irrelevant. 

• One group: Only {1,2,3,4, 5} is relevant. 

• Overlapping groups: The groups {1}, {1,2}, ... , {1,2, 3,4, 5} are relevant. 

For the three cases, we choose cr^ so that the total noise variance Pa^ equals the total signal 
variance in each case. 

We consider four models of increasing complexity for inference: 


LASSO-like: In this simplest model, we only use the singletons, A = {{1}, •. •, {T*}}, and 
moreover, we force f{A) to be constant across A; In order to do so, we change the update 


for f{A) to f{A) = T 


KT.a^aA+^) 


5 ELi EAe.4 Wdhllli+TrS^^) 


. This mimics the behavior of the LASSO, 


as the prior (that we are learning here) is the same for each coefficient. 
Weighted LASSO-like: The usual model with A = {{1},..., {T*}}. 


• Structured: The usual model with A= {{(5}q=i,....p, {!,• ■ • ,Q}q= 2 ,...,p}- 

• Structured (active set): The model where we also learn A using Algorithm (with 
parameters T = 4P, D = 2P, and 5 active set updates). 


We examine each of the 12 combinations of data generation and learning models. In each case, we 
use half of the tasks to find the optimal /3 in terms of the mean squared prediction error (i.e., the 
mean squared difference between the true and the learned signals w^) from a predefined range of 
7 values, and the other half to learn with this (3 and evaluate the test error. 


Results. We begin by examining the performance of each of the four models in signal recovery. 
In Table we report the mean squared error on the 5,000 test tasks with 95%-confidence error 
bars. For all three regimes for data generation, the LASSO-like model performs far worse than 
the three others in recovery. This is due to the fact that this model learns the same prior for all 
variables, although not all variables have the same marginal variance. In the first and third data 
generation regimes W.LASSO performs slightly better than Structured in signal recovery, while 
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One group, Structured 


Overlapping, Structured 


1 

2 

3 



0 0.00010.001 0.01 0.02 0.05 0.1 

fi 


Figure 6: For each group of variables on the y 
of total explained variance per /?. 



0 0.00010.001 0.01 0.02 0.05 0.1 

a 

axis, the intensity of gray indicates the percentage 


Structured has an advantage when a single group is relevant. The performance of Structured(AS) 
is systematically close to, or on a par with that of the best-performing model. 

In terms of structure recovery^ for all three data generation regimes, we find one or more values 
of 13 that lead to the recovery of the relevant groups by Structured and Structured(AS), with 
either the same or a slightly different /3 value leading to the smallest error in signal recovery. 
Figure illustrates the percentage of total explained signal variance by each group for the One 
group and Overlapping regimes and for the Structured model, for all considered regularization 
parameters: With no regularization, the model explains the signal with both the relevant group(s) 
and the singletons included in the relevant group(s), however with more and more regularization, 
the signal variance explained by smaller groups is taken over by larger ones. The groups containing 
elements from {6 ,..., 10}, not shown in the plot, explain no variance in no regularization regime, 
with the exception of the largest group {!,..., 10} that explains the weak signal coming from the 
irrelevant groups (recall that we have non-zero signal variance 0.01 x |A| for the irrelevant groups 
A) in weak and moderate regularization regimes and takes over the whole signal variance when 
the regularization is too strong. 

In summary, the performance in denoising does not change drastically depending on the amount 
of regularization, unless it is too strong; However, a small amount of regularization is likely to 
better capture the structure than no regularization; If there is a strong group structure among the 
variables, regularization may also lead to better recovery. A formal criterion to set the value of 
the hyperparameter f3 would be to maximize its likelihood, as is customary in Bayesian methods. 


4.2 Image denoising with wavelets 

In this section, we consider the image denoising problem using wavelets. The Haar wavelet basis 


for 2-dimensional images (Mallat, 1998) can naturally be arranged in three rooted directed quad¬ 


trees, which can be connected to form one tree by attaching the three roots to an artificial parent 
node; The structured sparsity-inducing norms with non-zero groups that are paths from the root 
in this tree have shown improvements over the £1 norm (Jenatton et al. 2011bI. Our goal is to find 


out whether, in this task, (a) a value of /3 that leads to good recovery for a set of images is also 
close to optimal for another set of images of roughly the same size, at least when the noise level 
is unchanged (stability of the hyperparameter); (b) learning a non-uniform prior on singletons 
improves recovery with respect to using a uniform prior (importance of learning a non-uniform 
prior); (c) learning the group structure helps beyond learning a non-uniform prior on singletons 
(importance of learning group relevances). 
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Figure 7: The images used in our experiments (Barbara, Fingerprint, Lena, House). 


Experimental setup. In order to denoise a large grayscale image, we cut it into possibly over¬ 
lapping patches of 32 x 32 pixels, which compose the multiple 1024-dimensional signals that we 
denoise simultaneously by learning the appropriate (structured) prior. We use four well-known 
images (see Figurej^, Barbara, Fingerprint, Lena (512 x 512 pixels each), and House (256 x 256 
pixels). Each signal Wk is formed by the wavelet coefficients of one 32 x 32 patch. For each of the 
K — 961 tasks (841 for House) we form by adding Gaussian noise of variance cr^ = 400 along 
each dimension. As in the previous section, we examine the performance of three instances of our 
model: the model with a uniform factorized sparse prior (LASSO-like), a non-uniform factorized 
sparse prior (W.LASSO-like), the structured norms on all descending (equivalently, ascending) 
paths in the rooted tree (Structured), and the structured norms on groups that we discover in the 
process of learning, with 2 active set updates (Structured(AS)). We consider a predefined range of 
6 values for the regularization hyperparameter (3, and 3 values (0.5,1.1,1.5) for the shape param¬ 
eter a of Student’s t. We compare the behavior of our methods with that of existing algorithms 
based on sparsity-inducing norms, which are not designed to learn group weights from data. From 
the family of such approaches, we choose the “Tree-^2” structured norm proposed by |Jenatton et 
al. (2011b), and the classical LASSO (Tibshirani 1994) on the wavelet coefficients. (We would 
like to stress here that “Tree-^2” does need group weights to be specified, but does not provide a 
systematic way to learn them. They are usually set by introducing a group-weighting parameter a 
so that a‘^ is the weight of all groups at depth d in the tree, and then optimizing a over a predefined 
range of values using cross-validation.) We run these methods on each set of small images with 
the regularization parameter A and the group-weighting parameter a (only for Tree-£2) varying 
over predefined ranges of 75 and 7 values respectively, and report the smallest error. To train the 
LASSO and learn with the Tree-£2 norm, we use the “proximal” toolbox of the software package 
SPAMS (Jenatton et al. 2011b). 


Results. Tableshows the best performance in terms of the mean squared error of each method 
on each image (which corresponds to a set of K small images). The values in the parentheses for 
our proposed methods indicate the value of j5 corresponding to the minimal error. The performance 
of our proposed methods with respect to the shape parameter a is systematically slightly better 
for larger a, and all reported results correspond to a = 1.5. According to these results, (a) the 
performance of a given value of /3 in signal recovery indeed seems to be stable across images (note 
that we have also observed that the performance on a given image is robust to small changes 
of the value of the hyperparameter); (b) the fact that the LASSO and our LASSO-like model 
are systematically outperformed by models that weight each variable confirms the intuition that 
learning how to weight individual variables should boost the estimation quality; (c) it seems that 
learning a prior on joint relevances of variables can lead to improved performance, as shown in the 
column corresponding to Fingerprint, although this is not always the case: on House and Lena, 
the performance of methods that learn group relevances is not significantly different from that 
of Tree-.^2, and in the case of Barbara they perform worse. Inspecting the relevances of different 
groups (paths in the wavelet tree) learned by Structured, we see that the groups explaining the 
bulk of the variance are overlapping groups of 2, 3, or 4 elements, mostly descending from the 
roots of the three quad-trees. In contrast, the relevant groups selected by Structured(AS) tend 
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Barbara 

House 

Fingerprint 

Lena 

LASSO-like 

179.0±4.6 (0.001) 

107.5T2.6 (0.001) 

247.5±1.7 (0.005) 

110.3T2.8 (0.001) 

W.LASSO-like 

163.3T5.1 (0) 

93.7±2.6 (0) 

195.0T1.8 (0.0001) 

89.5T3.2 (0) 

Structured 

164.8±5.3 (0) 

95.3T2.9 (0) 

193.6±1.8 (0.0005) 

90.3±3.5 (0) 

Structured(AS) 

163.1T5.0 (0.0001) 

92.9T2.3 (0.0001) 

194.9T1.8 (0.001) 

89.5T2.8 (0.0001) 

Tree -^2 

155.3±6.4 

93.3T3.8 

214.9±2.4 

88.7±3.7 

LASSO 

176.7T6.4 

102.1T3.6 

250.0T2.2 

106.6T3.9 


Table 2: Squared error averaged over the images with 95%-confidence error bars for each combina¬ 
tion of data generation and learning models. The usage of boldface indicates that the corresponding 
method significantly outperforms the others, as measured using a t-test at the level 0.05. (Each 
number is divided by 1000 for readability.) 


to consist of one to three roots of the three wavelet quad-trees and one or two wavelets of higher 
frequency, suggesting that paths in the wavelet tree may not always be the most natural groups in 
this problem. At last, let us stress that while “Tree-£ 2 ” is applicable in problems where variables 
can be structured in a tree given in advance, our proposed approach applies to any known or 
unknown group structure. 

The Matlab code used in our experiments is available at http; //cbio. ensmp. f r/~nshervashidze/ 
code/LLSS 


5 Conclusions and Future Work 


In this paper, we have proposed a flexible and general probabilistic model and an associated infer¬ 
ence scheme for automatically learning the weights of possibly overlapping groups in the context 
of structured sparse multi-task linear regression. We have shown that the classical variational 
inference scheme is not well adapted for learning with this model, and have proposed a regulariza¬ 
tion method that closes this gap. This has allowed us to investigate the effect of learning group 
weights in denoising problems, leading to the conclusion that learning penalties can significantly 
improve prediction quality, as well as the interpretability of the models, in this context. We have 
furthermore devised an active-set procedure that makes the inference with our model scalable to 
settings with large P and a large number of potential groups in A. 

In our future work we may consider different likelihood models to handle settings different 
from linear regression, such as binary classification. Learning group relevances for classification is 
indeed crucial, e.g., in the context of genome-wide association studies with binary phenotypes in 
computational biology, or for image segmentation in computer vision. 

In the appendix we provide details on the derivation of the variational inference scheme for our 


model (briefly introduced in Section 2.31 and discuss efficient ways of implementing the closed-form 
updates ©• 


A Variational inference for the super-Gaussian structured 
sparse prior 


In this appendix, we derive step by step the variational updates given in Section 2.3. 

We first recall our model: We assume that VA G A, v\ has an isotropic density with inverse scale 
parameter f{A) 

pivWfiA)) = g^(||u^|| 2 /(A)l/ 2 )/(A)l^l/^ ^revisited) 

where qA is a super-Gaussian distribution, that is, the logarithm of qA{u) is convex in and 
non-increasing (Palmer et al. 2006). It therefore admits a representation of the following form by 
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convex conjugacy 


^Og qA{u) = sup-^ - (j)A{s), 0 revisited) 

s>0 ' 

where (fiAis) is convex in 1/s. Note that the expression under the supremum in Q has a unique 
maximizer. In this work we only consider qa for which this maximizer has an analytical simple 
form. From ([^ and Q, we get the following variational representation for p(v^|/(A)): 


Lil 


piv%\fiA)) = f{A)^ 


sup e 
Cl>o 


-- ’PaICa) 


= f(A) 2 sup 
Cl>o 


k , m 




([^ revisited) 


fiA). 




Finally, as v\ are assumed independent, 


P({^^AUe^l/) = n Pi^AlfiA)). 

AeA 


Our goal is to infer the set function / from data by maximizing the type II log-likelihood, 

K 

Y.\ogp{y>^\f). 

fc=i 

We tackle the problem using variational inference and consider the following lower bound on 
logp(y^|/) (obtained by combining the density of (|^, the variational representation of the prior 
density on v\ Q, and taking into account the independence of {v\}a^a)^ for sets A S ^ C 2^ 
and for each regression task k, where we use the following notation: 


• is the concatenation of all elements indexed by elements oi A in A A, 

• is a square diagonal matrix of dimension diagonal consists of replicated 

|yl| times, for each A € A. 

• is a square diagonal matrix of dimension '^a&a I^I- diagonal consists of f{A), repli¬ 
cated |A| times, for each A^ A. 

• M is a matrix of dimension P x I^I ensures = Mv^. 
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logp(/l/) 

= log f p(/|M}ag^) n Pi'^A\fiA))dVA 


AeA 


log f Af{y^\X'^Mv'',a'^‘^I) sup /(A)' = ' (^'a| 0, 4^/) (27r-|^] " e ]J dv 

Jr- agaC1>o ^ ^ fiA) f{A)J i 

'°^l n/(A)^^ („>, jx,) ( 2 ,^) n 

sup log f n /(^)^-^(ul|0,-^/)(27r-^)^e-^-(^i) n 

> 0 ,-AG-A «/]R^ _AGw 4 . J \ ) J \ ) 

sup I log [ AA(/|X'=M^;^a'="/) n -^felO, ^/) [] 

>0,AgA ^ JiA) ) . , 


> 


Ci>o 


4 


C 1 >o.agA 


AgA 


AgA 


AgA 


sup (log / AA(y'=|X'=M^;^c^'=4)AA(^;'=|0,Z*^F-4d^;'^ 
>o.aga Jr- ^ ^ 


C:^>o.A 6 A 


+ E [V iog/(^) + V log ( 2 ^ 7 ^) - } 

AgA 

sup I logAA(/|0, X^MZ’^F-^M^X’^^ + ct^/) 

•sn A /- y1 ^ 


C1>o,agA 


AgA 
1 I-T 


-1 ^ 1 


sup \--y'^^(x'^MZ’=F-^M^X'^\a'^l) / - ^ logdet (x'^MZ'^F-^M^X'^Aa'^l) 

indc 2 | 12 \ / 2 \ / 


Q>0.AgA 


N 

Y 


log(27r)+ ^ [i^log/(A) + ^log(27r-^) - <^a(Ca)] }• 


41 


4 


AgA 


fiA)^ 


Thus, we need to minimize the following overall bound with respect to / and Ca f®'' all A G A and 
kG{l,...,K}: 

-11 

- ^ I - -4^ (X'^MZ'^F-^M^X'^^ + 4 - - logdet {x'^MZ’^F-^X’^^ + 4/) 

k^l 

+ E ^ log/4) + ~ ^ log(27r) + ^ logdet(Z"f-^) - ^ MC\)]- 

AgA AgA 

([^ revisited) 

In its form given by the bound is difficult to optimize. However, we can recognize parts of it 
as minima of convex functions, which will allow us to design an iterative algorithm with analytic 
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updates, finding a local minimum. In particular, it is not difficult to show that 


i logdet + cr^l) - ^ logdet(Z'=F-i) (matrix determinant lemma) 

= i logdet X'^M + ^ log(cr^) 

= inf - Tr{(M^X^^X^M + cr^j^Z'=~^)A^}-- logdet A'^-|A| + ^ log(cr2), 


which, with a change of variables = cr^A^, is written as 


i”h 5 ^ Tr{(MTx‘^.>f‘M)) + 1 2 _ 1 logdet e‘ - 1 |yl| + E l„g,o^,, 


E '':^0 2 cr 2 
and that 

1 I.T 


AeA 


AeA 


./^pZk 


(x'^MZ’^F-^M^X'^^ Fa'^l) / = inf ^||/- X'=Mu'=||^ + 
z \ / Zg^ 

Thus, from ([^, our optimization problem becomes 

^\\y^ - + ^^(W^aWI + ^^^^a) 

■'">0 «'= t zcr^ 2 •“. C . 

1 /V^ 1 T 1 

- - log det ^ log(cT2) + — log(27r) + ^ Tr X^ ^ E I^I 

+ H [“ -^log27r + ())A(CA) - ^l^|log/(A)j|, 


k=l 


AeA 


and the updates are 


AeA 

E'= = a^(M ' ' X'^M + a^FZ^~"y 

v'^ = {M^X'^^ X^M + <7‘^FZ'^~^)-^M^X'^^y^ 

Ca = argmin(/)A(2:) + + TrE^^) 

2 >o 2 z 


([To] revisited) 


a 2 = 


ELi { 11 / - A :'= Mw '=||2 + TrM ^ X '^ X '= ME '=} 




(19) 


K 

f{A) = argmin^^ ^(llu^ll^ +TrE^^) - A:-|A|loga; 

“ fc=i 

K\A\ 


Eti^(lKlli + TrE)iJ 


A.l Regularized version 

As we empirically show in Section 3.1, the variational approximation scheme from above tends to 
overestimate the variance of the prior distribution (i.e., underestimate the inverse scale parameter 
f{A)) when this variance is smaller than cr^, the noise variance. This is undesirable, as we would 
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like f{A) to go to infinity for irrelevant subsets of variables. To circumvent this problem, we use 
an improper prior of the form 

pif{A)) oc f{Af 

to encourage f{A) to go to infinity when the variance of p{va) is smaller than cr^. Consequently, 
the term —/31og/(y4) is added to the objetive function ( [Io| , and the only update that changes is 
the update for f{A): 


f{A) = argmin^^ ^(||n^||2 +TrE^^) - + P)\ogx 

k04+I3) 

5 Ylk=i ^(II^^aIII + Tr 


( 20 ) 


A.2 Faster updates 


The update equations (19) involve the inversion of the I^I ^X)agA I^I matrix X’^M+ 

a^FZ^ In fact, using the matrix inversion lemma, we can avoid performing this expensive in¬ 
version by rewriting the updates so that we only have to invert a P x P or an x matrix 
instead. 


Before we write down the modified updates, let us introduce some additional shorthand notation: 


• is the sum X)agA ^A S denotes the indicator vector for the index 

set A; 

• is a square diagonal matrix with = 1,... ,P; Put differently, = 

• Fl^ is a square diagonal matrix corresponding to Z^F~^. 




Yk 

C F ^ -I -I 


-h cr" 


C^^X^ + (P-E'^ 


-Ix-ln 


J ij 


( 21 ) 


M\\2 - fPe'^ 




Nk ^ matrix inversion. 

Efc = Tjfe _ {x’^E'^X'^^ + Piy^X^MH^ 

v'^ = X^^ {X^E'^X'^^ +Pl)~^y'^. 


( 22 ) 
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Special case of no design (signal denoising). When X* = I, the computations become 
considerably simpler. Note that in this case = P and the matrix + cr^S^ ^ is diagonal, 

so the cost of its inversion is 0{P) instead of 0{P^). In fact, we do not even need to form the 
diagonal matrix, as we do not need to explicitly use and in the updates. The updates can 
be rewritten as follows: 


TrS^^ = |A| 






Cl 


f{A) f^Af if f{Af (1 + ay if) 


E 


WvWii = ^ 
f{Ay 


Ei 


i^A Si 




^fe2 Lx ^2/^k 


cl = aigmm(l)A{z) + ^ {\\va \\1 + Tt 

z>0 z Z 

k04+(3) 


(23) 


fA)- 1 K 1 


sEtiTYdKlIi + TrElJ 


(The updates for f{A) and Cl remain unchanged.) 

In this case the computation of w^,k € {I,...,if}, and of the value of the objective are also 
simplified. To obtain compute each component of ul as 


[v% 


Cl 1 yf 
f(A)ifl + ayif 


2 if * € A ,0 otherwise, 

fiA) if + (t2 


and the objective as 


inf inf inf 
C '=>0 


k=l A&A 


4 E(||ul ||2 + TrSlJ 
Cl 


+ jEiog(cT^+cl) + jEi^|iog'^^^^ 


AeA 


/■k 


i=l Ae.4 


E - ■^iog27r + ())A(Cl) - ^l^|log/(.4) |. 


AeA 


Here we have used 

logdet = logdet[CT 2 (M^M + yFZ’^~^)-y 

= \A\ logy - logdet{M^M + yFZ'^~^) 

AeA 

= Y \A\logy - det{yi + M^Z'^F-^M)det{FZ^~^)] 

AeA 

p 

= piogy-Y^og{y+if)-Y i^ii°g 
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and 




= TrS'= - Tr 


a'^iy^MH'^M^ 


=i:f.‘-E 


frk- 

Si 


i=l 

P 


^4'=+a2 


= E 




+' 


Note that if we update the variables in the same order as in (23), then logdetS^, 
TrAf^ME*"' have to be computed before updating and f{A); This will ensure that 


and 

and 


Ik^llii respectively TrSj^^ and the two terms involving logdetS^ and TrM^MS^, are consis¬ 
tent, that is, they correspond to the same value of v\,A G A, respectively E^. 
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